The Twofish Encryption Algorithm: A 128-Bit Block Cipher The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson
Wiley Computer Publishing, John Wiley & Sons, Inc.
ISBN: 0471353817   Pub Date: 03/01/99
  

Previous Table of Contents Next


5.1.1 Keying Options

We implemented four different keying options. All of our keying options pre-compute Ki for i = 0,...,39 and use 160 bytes of RAM to store these constants. The differences occur in the way the function g is implemented. There are several other possible keying options, each with slightly different setup/throughput tradeoffs, but the examples listed below are representative of the range of possibilities.

Full Keying.
This option performs the full key precomputations. Using 4 Kbytes of table space, each S-box is expanded to a 8-by-32-bit table that combines both the S-box lookup and the multiply by the column of the MDS matrix. Using this option, a computation of g consists of four table lookups, and three XORs. Encryption and decryption speeds are constant regardless of key size.
Partial Keying.
For applications where few blocks are encrypted with a single key, it may not make sense to build the complete key schedule. The partial keying option precomputes the four S-boxes in 8-by-8-bit tables, and uses four fixed 8-by-32-bit MDS tables to perform the MDS multiply. This reduces the key-schedule table space to 1 Kbyte. For each byte, the last of the q-box lookups is in fact incorporated into the MDS table, so only k of the q-boxes are incorporated into the 8-by-8-bit S-box table that is built by the key schedule. Encryption and decryption speeds are again constant regardless of key size.
Minimal Keying.
For applications where very few blocks are encrypted with a single key, there is a further possible optimization. Compared to partial keying, one less layer of q-boxes is precomputed into the S-box table, and the remaining q-box is done during the encryption. For the 128-bit key this is particularly efficient, as precomputing the S-boxes now consists of copying the table of the appropriate q-box and XORing it with a constant (which can be done word-by-word instead of byte-by-byte). This option uses a 1 Kbyte table to store the partially precomputed S-boxes. The necessary key bytes from S are of course precomputed, as they are needed in every round.
Zero Keying.
The zero keying option does not precompute any of the S-boxes, and thus needs no extra tables. Instead, every entry is computed on the fly. The key setup time consists purely of computing the Ki values and S. For an application that cannot have any key setup time, the time it takes to encrypt one block is the sum of the key setup time and encryption time for the zero keying option.
Compiled.
In this option, available only in assembly language, the subkey constants are directly embedded into a key-specific copy of the code, saving memory fetches and allowing the use of the Pentium LEA opcode to perform both the PHT and the subkey addition all in a single clock cycle. Some additional setup time is required, as well as about an extra 5000 bytes of memory to hold the “compiled” code, but this option allows the fastest execution time of 258 clocks per block on the Pentium Pro. The setup time for a Pentium more than doubles over the full keying case because of its smaller cache size, but the Pentium MMX setup time is still comparable to the Pentium Pro setup time. However, almost all the extra time required is consumed merely in copying the code; the table does not reflect the fact that, once a single key has been initialized, future keys can be compiled at a cost of only a few hundred more clocks than full key schedule time.

5.1.2 Code and Data Size

As shown in Table 5.1, the code size for a fully optimized Twofish implementation on a Pentium Pro ranges from about 8450 bytes in assembler to 14100 bytes in Borland C. In assembler, the encryption and decryption routines are each about 2250 bytes in size; the remaining 4000 bytes of code are in the key scheduling routines, but about 2200 bytes of that total can be discarded if only 128-bit keys are needed. In Borland C, the encryption and decryption routines are each about 4500 bytes in length, and the key schedule routine is slightly less than 5000 bytes. Note that each routine fits easily within the code cache of a Pentium or a Pentium Pro. These sizes are larger than Blowfish but very similar to a fully optimized assembly language version of DES. Note that, with the exception of the zero keying option, the code sizes in the table are for fully unrolled implementations; in either C or assembler, it is possible to achieve significantly smaller code sizes using loops with round counters, at a cost in performance.

In addition to the code, there are about 4600 bytes of fixed tables for the MDS matrix, q0 and q1 required for key setup, and each key requires about 4300 bytes of key-dependent data tables for the full keying option. During encryption or decryption, these tables fit easily in the Pentium data cache. The other keying options use less data and table space, as discussed above.

5.1.3 Large Memory Implementations

For machines with sufficient RAM and a good memory cache subsystem, large precomputed tables can be used to reduce the key setup time for Twofish even further. For example, in compiled, full, or partial keying modes, the first two levels of q0 and q1 lookups with one key byte can be precomputed for all four S-boxes, requiring 256 Kbytes of table (four tables of 64 Kbytes each). This approach saves roughly 2000 clocks per key setup on the Pentium Pro in assembly language; details are shown in Table 5.2. For instance, the compiled mode key setup for 128-bit keys on a Pentium Pro can be reduced from 8700 clocks to 6500 clocks. Unfortunately, the savings on Pentium and Pentium MMX CPUs seems to depend on the performance of the L2 cache subsystem (which is included in the Pentium Pro and thus is more predictable); the gain seems to range from 500 clocks down to nothing. Implementing this “big table” version in C also leads to savings of about 1000 clocks per key setup on the Pentium Pro, depending on the quality of the compiler; again, Pentium performance gains are minimal.

Processor Lang Keying Option Code Size Clocks to Key Clocks to Encrypt
        128 192 256 128 192 256
PPro/II ASM Comp. 271,200 6500 9200 11900 258 258 258
PPro/II ASM Full 270,600 5300 8000 11000 315 315 315
PPro/II ASM Part. 272,900 2600 5300 8200 460 460 460
PPro/II MS C Full 273,300 7300 11200 15700 600 600 600
Table 5.2. Twofish Performance with Large Fixed Tables


Previous Table of Contents Next